Long-Tail Distributions and Unsupervised Learning of Morphology

نویسندگان

  • Qiuye Zhao
  • Mitch Marcus
چکیده

In previous work on unsupervised learning of morphology, the long-tail pattern in the rank-frequency distribution of words, as well as of morphological units, is usually considered as following Zipf’s law (power-law). We argue that these long-tail distributions can also be considered as lognormal. Since we know the conjugate prior distribution for a lognormal likelihood, we propose to generate morphology data from lognormal distributions. When the performance is evaluated by a tokenbased criterion, giving more weights to the results of frequent words, the proposed model preforms significantly better than other models in discussion. Moreover, we capture the statistical properties of morphological units with a Bayesian approach, other than a rule-based approach as studied in (Chan, 2008) and (Zhao and Marcus, 2011). Given the multiplicative property of lognormal distributions, we can directly capture the long-tail distribution of word frequency, without the need of an additional generative process as studied in (Goldwater et al., 2006).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

The Tail Mean-Variance Model and Extended Efficient Frontier

In portfolio theory, it is well-known that the distributions of stock returns often have non-Gaussian characteristics. Therefore, we need non-symmetric distributions for modeling and accurate analysis of actuarial data. For this purpose and optimal portfolio selection, we use the Tail Mean-Variance (TMV) model, which focuses on the rare risks but high losses and usually happens in the tail of r...

متن کامل

Fixing the Infix: Unsupervised Discovery of Root-and-Pattern Morphology

We present an unsupervised and languageagnostic method for learning root-andpattern morphology in Semitic languages. We harness the syntactico-semantic information in distributed word representations to solve the long standing problem of root-and-pattern discovery in Semitic languages. The root-and-pattern morphological rules we learn in an unsupervised manner are validated by native speakers i...

متن کامل

Range Distributions of Low-energy Nitrogen and Oxygen Ions in Silicon (RESEARCH NOTE)

The range distributions of low-energy nitrogen and oxygen (2-3 keV) ions is silicon are measured and compared with these available in theories. The nitrogen distribution is very close to a Gaussian distribution as predicted by theory. The oxygen profile however, indicates a surface localized peak along with a shoulder and a long tail into the sample. The surface peak is beleived to he the resul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012